HW3: Wikipedia Clustering

نویسندگان

  • Shiry Ginosar
  • Luke Segars
چکیده

Overview Our goal for this assignment was to recreate Wikipedia’s article groupings into semantic categories by means of clustering. Given an input of Wikipedia’s XML dump, we designed a pipeline of MapReduce jobs aimed at clustering the articles. We then use the categorical groupings of Wikipedia to create a ground truth for clusters, and measure how well our clusters align to the ground truth. At this time we have implemented all stages of this pipeline and have working code that we include with our submission. Unfortunately, we were not able to run our scripts to completion and do not include final results due to challenges with cluster availability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Categorization of Wikipedia Articles with Spectral Clustering

The article reports application of clustering algorithms for creating hierarchical groups within Wikipedia articles. We evaluate three spectral clustering algorithms based on datasets constructed with usage of Wikipedia categories. Selected algorithm has been implemented in the system that categorize Wikipedia search results in the fly.

متن کامل

Clustering Document with Active Learning using Wikipedia

Wikipedia has been applied as a background knowledge base to various text mining problems, including document categorization, topic indexing and information extraction. However, very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit Wikipedia and the semantic knowledge therein to facilitate clustering, enabling the automatic grouping of docum...

متن کامل

Clustering of Wikipedia Pages on Edit Behaviors

We consider the edit history of Wikipedia to perform clustering of the pages. We conjecture that the editors exhibit homophily or high correlation (in terms of the topics of interests). Therefore, it is possible to utilize the edit history to cluster pages having same or closely related topics. We validate our clustering results with the list of categories and the incoming and outgoing links on...

متن کامل

Conceptual Hierarchical Clustering of Documents using Wikipedia knowledge

In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. A robust and compact document representation is built in real-time using the Wikipedia API. The clustering process is hierarchical and creates cluster labels which are descriptive and important for the examined corpus. Experiments show that the proposed techniqu...

متن کامل

CS294-1 A3: Large-scale Clustering

In this project, we are given a task of clustering wikipedia articles. As the data size is relatively large and cannot be memory-resident on a single node computer, we first adopt map-reduce dataflow to extract the word counts and build feature matrices. Given the compact representation of feature matrix, the clustering task is also computationally challenging due to the large number (tens of m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012